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Abstract 

An algorithm is developed to compute the complete CS decomposition (CSD) of a 
partitioned unitary matrix. Although the existence of the CSD has been recognized 
since 1977, prior algorithms compute only a reduced version (the 2-by-l CSD) that is 
equivalent to two simultaneous singular value decompositions. The algorithm presented 
here computes the complete 2-by-2 CSD, which requires the simultaneous diagonaliza- 
tion of all four blocks of a unitary matrix partitioned into a 2-by-2 block structure. The 
algorithm appears to be the only fully specified algorithm available. The computation 
occurs in two phases. In the first phase, the unitary matrix is reduced to bidiagonal 
block form, as described by Sutton and Edelman. In the second phase, the blocks are si- 
multaneously diagonalized using techniques from bidiagonal SVD algorithms of Golub, 
Kahan, and Demmel. The algorithm has a number of desirable numerical features. 

1 Introduction 

The complete CS decomposition (CSD) applies to any m-by-ra matrix A from the 
unitary group U(m), viewed as a 2-by-2 block matrix, 



X = 



V 


' An 


A12 


m — p 


A21 


A22 



For convenience, we assume q < p and p + q < m. A complete CS decomposition has 
the form 



A = 



Ui 
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s 














Ip-q 
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I-m —p — q 



u 2 

C — diag(cos(0i), . . . , cos(0 q )), S = diag(sin(0i), 



(1.1) 



in which 0, 6 [0, f], Ui G U(p), U 2 e U(m-p), Vx G U(q), and V 2 £ U(m-q). 

The letters CS in the term CS decomposition come from cosine-sine. 

The major contribution of this paper is an algorithm for computing (|1.1|) . We 
believe this to be the only fully specified algorithm available for computing the complete 
CS decomposition. Earlier algorithms compute only a reduced form, the "2-by-l" CSD, 
which is defined in the next section. The algorithm developed in this article is based 
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on the SVD algorithm of Golub and Kahan and has a number of desirable numerical 
properties. 

The algorithm proceeds in two phases. 

1. Phase I: Bidiagonalization. In the special case p = q — ^, the decomposition is 



X = 



Pi 



B (o) 
B. 



(o) 

12 

(«) 



(1.2) 



in which B^ and £>2i are upper bidiagonal, B 12 



^ and B^2 are lower bidiagonal, 



l[? and fl<°> 

and Pi, P2, Qi, and Q2 are q-by-q unitary. We say that the middle factor is a 
real orthogonal matrix in bidiagonal block form. (See Definition 1 1.11 ) 

3(0) n (0) ' 



2. Phase II: Diagonalization. The CSD of 



b\7 B L 



is computed, 



0(0) 
13 11 
0(0 

D 2\ 
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(0) d(0) 
^22 
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Combining the factorizations gives the CSD of X, 
P1U1 



X = 



P2U2 



c s 
-s c 



c s 
-s c 



Q1V1 



Vi 



V2 



2V2 



(1.3) 



Phase I is a finite-time procedure first described in [18], and Phase II is an iterative 
procedure based on ideas from bidiagonal SVD algorithms [BJ [S]. 

Some of the earliest work related to the CSD was completed by Jordan, Davis, 
and Kahan [4] [5] [12] . The CSD as we know it today and the term CS decomposition 
first appeared in a pair of articles by Stewart [161 117] . Computational aspects of the 
2-by-l CSD are considered in O [131 [I4j [T71 [19] and later articles. A "sketch" of an 
algorithm for the complete CSD can be found in a paper by Hari [11| . but few details 
are provided. For general information and more references, see [21 11UI [To] , 

1.1 Complete versus 2-by-l CS decomposition 

Most commonly available CSD algorithms compute what we call the 2-by-l CS de- 
composition of a matrix X with orthonormal columns partitioned into a 2-by-l block 
structure. In the special case p — q — ^, X has the form 



X = 



X11 
X21 



and the CSD is 



X 



' Ui 




c ' 


u * _ 




-s 



A naive algorithm for computing the 2-by-l CSD is to compute two SVD's, 

In = UiCV? 
X21 = (~U2)SVi, 

reordering rows and columns and adjusting signs as necessary to make sure that the 
two occurrences of V* are identical and that C 2 + S 2 = I. This works in theory if no 
two singular values of Xn are repeated, but in practice it works poorly when there are 
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clustered singular values. Still, the basic idea can form the backbone of an effective 
algorithm for the 2-by-l CSD [171 IT5] . 

Unfortunately, many algorithms for the 2-by-l CSD do not extend easily to the 
complete 2-by-2 CSD. The problem is the more extensive sharing of singular vectors 
evident below (still assuming p — q — ^): 

In = UiGV? Xra = U-lSV 2 * 
X21 = (-U 2 )SV* X22 = U 2 CV 2 *. 

All four unitary matrices Ui, U 2 , V\, and V2 play dual roles, providing singular vectors 
for two different blocks of X. Enforcing these identities has proven difficult over the 
years. 

Our algorithm, unlike the naive algorithm, is designed to compute the four SVD's 
in (|1.4|l simultaneously, so that no discrepancies ever arise. 

1.2 Applications 

Unlike existing 2-by-l CSD algorithms, the algorithm developed here fully solves Jor- 
dan's problem of angles between linear subspaces of R n [T2] . If the columns of matrices 
X and Y are orthonormal bases for two subspaces of R n , then the principal angles and 
principal vectors between the subspaces can be computed in terms of the SVD of X T Y 
[15] . The complete CSD, equivalent to four SVD's, simultaneously provides principal 
vectors for these subspaces and their orthogonal complements. 

In addition, our algorithm can be specialized to compute the 2-by-l CSD and hence 
has application to the generalized singular value decomposition. 

1.3 Numerical properties 

The algorithm is designed for numerical stability. All four blocks of the partitioned 
unitary matrix are treated simultaneously and with equal regard, and no cleanup pro- 
cedure is necessary at the end of the algorithm. In addition, a new representation 
for orthogonal matrices with a certain structure guarantees orthogonality, even on a 
floating-point architecture [15] . 

1.4 Efficiency 

As with the SVD algorithm of Golub and Kahan, Phase I (bidiagonalization) is often 
more expensive than Phase II (diagonalization) . For the special case p — q = ^, 
bidiagonalization requires about 2m 3 flops — about |q 3 flops to bidiagonalize each block 
of [ ] and about |g 3 flops to accumulate each of Pi, P2, Qi, and Q2, for a total 

of about 4 (§<j 3 ) + 4 (f ) q 3 = 2m 3 flops. 

1.5 Overview of the algorithm 
1.5.1 Bidiagonal block form 

During Phase I, the input unitary matrix is reduced to bidiagonal block form. A matrix 
in this form is real orthogonal and has a specific sign pattern. Bidiagonal block form 
was independently formulated by Sutton in 2005 [18] . Some similar results appear in 
a 1993 paper by Watkins [2D]. The matrix structure and a related decomposition have 
already been applied to a problem in random matrix theory by Edelman and Sutton 
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Definition 1.1. Given = (0i, . . . , 6 q ) G [0, f ] q and = (0i, . . . , a _i) G [0, f] 9 " 1 , 
let Cj = cos9i, Si — sinOi, c\ = cos0;, and s£ = sin0;. Define Bij(9,<f)), i,j = 1,2, to 
be g-by-g bidiagonal matrices, as follows. 



' Bn (61,0) 


£12(0,0) " 


. B 21 (9,4>) 


£ 22 (0,0) _ 



Ci — Sis'x 




sici 






C2c'i 


— Sq-lSq— 1 
c q C q—1 


C2s'l 


S2C2 


C q S q — 1 S <2 


-Si — Ci.s' 1 




cici 






-s 2 ci 


— C q -lS q _ 1 
— SqC ? _l 


-s 2 si 


c 2 c' 2 


SqSq — 1 c q _ 



(1.5) 



Any matrix of the form 



Bu (61,0) 


£i2(M) " 


521(0,0) 


522(0,0) _ 



is said to be in bidiagonal block form and is necessarily real orthogonal. 

To clarify (|1.5[) , the (q— 1, g— 1) entry of Buifi, 0) is s ? _iCg„i, and the (g— 1, q — 1) 
entry of £22(0, 0) is Cq_ic^_i. Also, if q = 1, then the matrices are defined by 



r s u (0,0) 


5l2(0,0) " 




Cl 


Si 


L S 21 (0,0) 


#22(0,0) _ 




-si 


Cl 



As stated in the definition, any matrix whose entries satisfy the relations of (|1.5|l 
is necessarily real orthogonal. The reverse is true as well — any orthogonal matrix X 
with the bidiagonal structure and sign pattern of (|1,5[1 is expressible in terms of some 
and 0. (This is implicit in [7] [18].) Furthermore, every unitary matrix is equivalent 
to a matrix in bidiagonal block form, as stated in the next theorem. 

Theorem 1.2. Given any m-by-m unitary matrix X and integers p, q such that < 
q < p and p + q < m, there exist matrices P\ £ U(p), P2 G U(m —p), Qi G U(q), and 
Q2 G U(m — q) such that 



Bn(0,0) 


Sl2(0, 0) 














Ip-q 





£21 (0,0) 


£22(0,0) 

















Im — p—q 



for some = (0i,...,0 9 ) G [0, f] 9 and = (0i, . . . , 9 -i) G [0, f] 9-1 . 

A proof of the theorem has already been published in [7, 18; , along with an algorithm 
for computing the decomposition. The algorithm applies pairs of Householder reflectors 
to the left and right of X, causing the structure to evolve as in Fig. Q] This serves as 
Phase I of the CSD algorithm. 

1.5.2 Simultaneous SVD steps 

Phase II of the algorithm simultaneously applies the bidiagonal SVD algorithm of 
Golub and Kahan 6, 8] [10] to each of the four blocks of a matrix in bidiagonal block 
form. 
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Figure 1: Reduction to bidiagonal block form 



The bidiagonal SVD algorithm is an iterative scheme. Given an initial bidiagonal 
matrix B (0 \ the algorithm produces a sequence — > — > B^ —>•■•—> E 
converging to a diagonal matrix of singular values. Implicitly, the step from B^ n ' to 
£}(«•+!) involves a QR factorization of {B^ n ') T —cr 2 I, for some appropriately chosen 
g > 0, but in practice the matrix [B^) T B^ is never explicitly formed. Instead, 
the transformation from B^ n ' to i?( n+1 ) is accomplished through a sequence of Givens 
rotations. The first Givens rotation introduces a "bulge," and the subsequent rotations 
"chase the bulge" away. 

Our algorithm applies this idea simultaneously to all four blocks to execute a CSD 



step. First, two bulges are introduced by a Givens rotation (Fig. 2(a) I, and then the 



bulges are chased away, also by Givens rotations (Fig. 2(b) \. The end result is a new 
matrix in bidiagonal block form whose blocks tend to be closer to diagonal than the 
original blocks. 



1.5.3 The driver routine 

The algorithm as a whole proceeds roughly as follows. 

• Execute Algorithm bidiagonalize to transform X to bidiagonal block form. (See 
Fig. [T). 

• Until convergence, 

— Execute Algorithm cscLstep to apply four simultaneous SVD steps. (See 
Fig. [3) 

The algorithm as a whole is represented by Fig. [3] 

Matrices in bidiagonal block form may be represented implicitly in terms of 9 and 
4> from Definition lf.il In fact, the overall algorithm implicitly represents the sequence 
of Fig. 13(b)] as 



(i) Ai) 



)-(* (2) ,^ (2) ) 



The implicitly represented matrices are exactly orthogonal, even in floating-point. The 
process stops when ( - JV ' is sufficiently close to (0, ... ,0); then the blocks of (|1.5[) are 
diagonal up to machine precision. 
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(a) Bulges arc introduced. 



(b) Bulges are chased. 

Figure 2: CSD step 
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(b) Iteration of the CSD step 
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Figure 3: Computing the complete CSD 
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1.6 Overview of the article 

The remainder of the article is organized as follows. 



Section 


Title 


2 


Phase I: Algorithm bidiagonalize 


3 


Reviewing and extending the SVD step 


4 


Phase II: Algorithm csd_step 


5 


Algorithm csd 


6 


On numerical stability 



The final section contains results of numerical tests on a BLAS/LAPACK-based im- 
plementation, which is available from the author's web site. 



2 Phase I: Algorithm bidiagonalize 

Phase I of the CSD algorithm is to transform the partitioned unitary matrix X to 
bidiagonal block form. 

Specification 2.1. Given an m-by-m unitary matrix X and integers p, q with < q < p 



andp + g < m, bidiagonalize(A,p, q) should compute 6^ = (#i 

0t°> = (^°>,...,^ 1 
Q 2 G U (m — q) such that 



G [0,f]«- 
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Im — p — q . 



Qi 



G U(q), and 



(2.1) 



m which B^j = B tJ (6( \^), i,j = 1,2, are bidiagonal matrices defined in 1)1.50 . 

The algorithm has already appeared in [7] [18]. It is reproduced here. Matlab- 
style indexing is used — A(i,j) refers to the i,j entry of A; A(i : k,j : I) refers to the 
submatrix of A in rows i, . . . ,k and columns j, . . . ,1; A(i : k, :) refers to the submatrix 
of A in rows i, . . . ,k; and so on. Also, house(j:) constructs a Householder reflector F = 
u(I — (3vv*) for which the first entry of Fx is real and nonnegative and the remaining 
entries are zero. (This is an abuse of common terminology — F is not Hermitian if w is 
not real.) If given the empty vector (), house returns an identity matrix. Finally, a, 
Si, c'i, and s' t are shorthand for cos#;' 

Algorithm 2.2 (bidiagonalize). 

Y~X; 

for i := 1, . . . , q 



' sin^ ', cos4>l°\ and sin^>' ', respectively. 



Ui 



U2 := 



ef ] :=atan2(||u2|U|wi||); 



Y(l:p, 1) if i = l 

c\_ x Y(i :p,i) + s'i_xY{i : p, q — 1 + i) if i > 1; 
-Y(p+l:m,l) if i = l 

—c' i _ 1 Y(p + i : m,i) — s' i _ 1 Y(p + i : m, q — 1 + i) if i > 1; 



1 ' I house(wi ' 2 " | house(w2 ) * ' 
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vi := 



—SiY(i, i + 1 : q) — CiY(p + i,i + 1 : qj if i < q 

, if i = q; 

V2 :— SiY(i, q + i : m) + CiY(p + i,q + i : m); 
if i < q, then 

^ :=atan2(|M|,|M|); 

o (l) -=\ h 

else 



end if 



Y :=Y 
end for 



house(^2 ) * 



Pl := P' 1 ' • ■ ■ A M ; P 2 := P 2 (1) • ■ ■ P 2 (q) ; Qi 



Using an LQ factorization, diagonalize the submatrix of Y lying in rows q + 1, . . . ,p 
and p + q + 1, . . . , m and columns 2q + 1, . . . , m and update Q 2 ; 



Theorem 2.3. Algorithm bidiagonalize satisfies Svecification \2. 1\ 
The proof is available in [7J [18] . 

We illustrate the algorithm by concentrating on the case m = 6, p = 3, q — 3. 
In the beginning, the matrix entries can have any signs, 
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To introduce zeros into column 1, two Householder reflectors, based on u\ — Y(l: 
3, 1) and u% = — Y(4:6, 1), respectively, are applied. 
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In fact, because Y is unitary, 
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Next, the algorithm focuses on rows 1 and 4. Now something nonobvious happens. 
Y(l, 2 : 3) and Y(4,2 : 3) are colinear (because Y(:, 1) is orthogonal to F(:,2) and 
Y(:, 3)), as are Y(l, 4 : 6) and Y(4, 4 : 6). (By colinear, we mean that the absolute 
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value of the inner product equals the product of norms.) The row vectors v\ and vi 
are defined so that v\ is colinear with Y (1,2:3) and Y(A, 2:3) and «2 is colinear with 
F(l,4:6) and Y(4, 4:6). Computing <f>^ and multiplying by Householder reflectors 
gives 
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The algorithm proceeds in a similar fashion. Now, Y(2 : 3,2) and Y(2 : 3,4) are 
colinear, as are Y(5:6, 2) and Y(5:6, 4). By computing 0% and applying Householder 
reflectors, we obtain 
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of Householder reflectors gives 
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and so on. Note that the final matrix is represented implicitly by 9\ 

. . . , 4>q-i: so that it is exactly orthogonal, even on a floating-point architecture. 
The following theorem will be useful later. A diagonal signature matrix is a diagonal 
matrix whose diagonal entries are ±1. 

Theorem 2.4. If Bn and B21 are upper bidiagonal, B12 and B22 are lower bidiagonal, 
and 

Bn B\2 
B 2 i B22 

is real orthogonal (but not necessarily having the sign pattern required for bidiagonal 
block form), then there exist diagonal signature matrices Di, D2, Ei, E2 such that 



Dz 



Bn B\2 
B21 B22 



E2 



is a real orthogonal matrix in bidiagonal block form. 



Proof. Run bidiagonalize. Pi, P2, Qi, and Q2 are products of Householder reflectors 
that are in fact diagonal signature matrices because the input matrix already has the 
correct zero/nonzero pattern. Let D\ = Pi, D2 = P2, E\ = Qi, and E2 — Q2- □ □ 



3 Reviewing and extending the SVD step 

In [8] [9], Golub, Kahan, and Reinsch developed a bulge-chasing method for computing 
the SVD of a bidiagonal matrix. This method, as modified by Demmel and Kahan [B], 
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is implemented as one of the most heavily used SVD routines in LAPACK. Phase II of 
our CSD algorithm is based on this SVD method. 

Given a real upper bidiagonal matrix B and a shift a > 0, the SVD step of Golub 
and Kahan applies a unitary equivalence B — S T BT derived from QR iteration on 
B T B — a 2 1. The step is designed so that iteration will drive a superdiagonal entry of 
B to zero (especially quickly if the shifts lie near singular values of B). 

This section reviews the SVD step of Golub and Kahan. There are two notable 
aspects not present in most descriptions. 

• First, a certain left-right symmetry is emphasized — not only is the SVD step 
equivalent to a QR step on B T B — cr 2 I; it is also equivalent to a QR step on 
BB T -a 2 I. 

• Second, the SVD step is extended to handle any number of zeros on the bidiagonal 
band of B. (The original SVD step of Golub and Kahan requires special handling 
as soon as a zero appears. The modification by Demmel and Kahan relaxes this, 
but only for zeros on the main diagonal and only when the shift is a = 0.) The 
modification is necessary in Section U 

3.1 QR step 

The SVD iteration of Golub and Kahan is derived from QR iteration for symmetric 
tridiagonal matrices. 

Given a real symmetric tridiagonal matrix A and a shift A £ R, a single QR step is 
accomplished by computing a QR factorization 

A- XI = QR, (3.1) 

then reversing the factors and putting XI back, 

A := RQ + XI. (3.2) 

Note that the resulting A is symmetric tridiagonal, because RQ = Q T (A — XI)Q is 
upper Hessenberg and symmetric. 

Note that if A is unreduced, i.e., its subdiagonal entries are all nonzero, then Q 
and R are unique up to signs. (Specifically, every QR factorization is of the form 
A — XI = (QD)(DR) for some diagonal signature matrix D.) However, if A has zero 
entries on its subdiagonal, then making the QR factorization unique requires extra 
care. The following definition introduces a "preferred" QR factorization. There are 
two important points: (1) the existence of the forthcoming CSD step relies on the 
uniqueness of the preferred QR factorization, and (2) the handling of noninvertible A 
supports the CSD deflation procedure. Below, the notation A © B refers to the block 
diagonal matrix [ A B ] . 

Definition 3.1. Let A be an m-bj-m real symmetric tridiagonal matrix. Express A 
as 

" A 1 

A 2 

A = 

in which each Ai is either an unreduced tridiagonal matrix or 1-by-l. A preferred QR 
factorization for A is a QR factorization with a special form. There are two cases. 
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Case 1: A is invertible. Then a preferred QR factorization has the form 
Qi 



A 



<?2 



Qr 



Rl 



R 2 



Rr 



and satisfies the following conditions: 
1. 



every Qi has the same number of rows and columns as Ai, is real orthogonal 
and upper Hessenberg, and has positive subdiagonal and determinant one 
(unless it is 1-by-l, in which case it equals the scalar 1), and 
2. every Ri has the same number of rows and columns as Ai and is upper 
triangular. 

Case 2: A is not invertible. Define Qi, . . . , Q r and R\, . . . , R r as in the first case, 
let k be the least index identifying a noninvertible A k , and let I be the index of 
this block's last row and column in the overall matrix A. (Note that the first zero 
diagonal entry of Ri ffi • ■ • ffi Rr must be in position (I, I).) Also, let 

h-i 
±1 

Im-l 

with the sign chosen so that det(P) = 1. A preferred QR factorization for A is A 



)A r 



QR with Q = (Qi0- • ■®Q k ®I m -i)P and R = P 1 {Ri®- ■ -®R k ®A k+1 ® 

Theorem 3.2. The terminology is valid: every "preferred QR factorization" really is 
a QR factorization. 

Theorem 3.3. If A is a real symmetric tridiagonal matrix, then a preferred QR fac- 
torization A — QR exists and is unique. 

The proofs are straightforward. 

Theorem 3.4. Apply the QR step \3.1\) - \3.ty) to a real symmetric tridiagonal matrix A 
using the preferred QR factorization. If the shift X is an eigenvalue of A, then deflation 
occurs immediately: the resulting A has the form 



* 






A 



Proof. A — XI is not invertible, so its preferred QR factorization satisfies Case 2 of 
Definition 13. II The rest of the proof uses the notation of that definition. The last row 
of Rk contains all zeros, so row I of Ri ®- ■ -®R k ® (^4fc+i — XI)®- ■ -®(A r — XI) contains 
all zeros. The permutation matrix P is designed to slide this row of all zeros to the 
bottom of R, and hence the last row of RQ also contains all zeros. By symmetry, the 
last column of RQ also contains only zeros, and so RQ is block diagonal with a 1-by-l 
block equaling zero in the bottom-right corner. Adding XI to produce A — RQ + XI 
makes the bottom-right entry equal X. □ □ 

The orthogonal factor Q in a preferred QR factorization can be expressed as a 
product of Givens rotations in a particularly simple way. A 2-by-2 Givens rotation 
with an angle of 8 is a matrix ~ s ] in which c = cos# and s = sin#. More generally, 
an m-by-m Givens rotation is an m-by-m real orthogonal matrix with a 2-by-2 principal 
submatrix that is a Givens rotation and whose principal submatrix lying in the other 
rows and columns is the (m — 2)-by-(m — 2) identity matrix. 
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Theorem 3.5. The orthogonal factor Q in a preferred QR factorization of an m-by- 
m real symmetric tridiagonal matrix can be expressed uniquely as a product of Givens 
rotations Gi • ■ ■ G m -i in which Gj rotates columns j and j + 1 through an angle in 

[0,7V). 

Proof. Existence is proved by Algorithm 13.71 below. Uniqueness is guaranteed by the 
upper Hessenberg structure: inductively, the (j + l,j) entry of Q determines Gj. □ 

□ 

Algorithm 3.6 (givens). This algorithm computes a Givens rotation, with a corner 
case defined in a particularly important way. 

1. Given a nonzero 2-by-l vector x, givens(m, ii,iz,x) constructs an m-by-m Givens 
rotation whose submatrix lying in rows and columns i\ and 22 is a 2-by-2 Givens 
rotation G = \° s ~ s ] with angle in [0, it) such that G T x = (±||z||, 0) T . 

2. givens(m, ii, 12, (0, 0) T ) is defined to equal givens(m, ii, 12, (0, 1) T ) — it rotates 
through an angle of ~. 

The choice to rotate through ^ when x — (0, 0) T allows the following algorithm to 
handle Cases 1 and 2 of the preferred QR factorization in a uniform way. 

Algorithm 3.7 (qr_step). Given an m-by-m real symmetric tridiagonal matrix A 
and a shift A £ R, the following algorithm performs one QR step. See Theorem 13.81 It 
is an extension of the idea on pp. 418-420 of [ID] . 

A := A - XI; 

for i = 1, . . . , m — 1 

I A(i : i + 1, i) if i = 1 or A(i : i + 1, i — 1) is the zero vector 

v = < _ 

I A(i : i + 1, i — 1) otherwise; 

d := givens(m, i, i + l,v); 

A := Gj Ad; 
end for 
A ■- A + XI- 

Theorem 3.8. Run Algorithm \3. 7| to compute Gi, . . . , G m -i and A, and define Q — 
Gl • • • Gm — l and R = Q T (A — XI). Then A — XI = QR is a preferred QR factorization 
and 13. 1\) and f3. are satisfied. 

Proof. The proof is broken into two cases. 

Case 1: A is not an eigenvalue of A. The proof is by induction on r, the number 
of unreduced blocks of A. 

The base case is r = 1. In this case, the algorithm executes the usual bulge- 
chasing QR step. See [TO] . 

For the inductive step, suppose r > 1 and assume that the theorem has been 
proved for smaller values of r. Let s be the number of rows and columns in the 
final block A r . The first m — s — 1 executions of the loop neither observe nor 
modify the final s rows or columns of A and by induction compute Qi, . . . , Q T — l 
and R\, . . . , R r -i of the preferred QR factorization. At the beginning of the loop 
with i — m — s, the (m — s, m — s — l)-entry of A is nonzero because A — XI is 
invertible, and the (m — s+l,m — s— l)-entry of A is zero, so G m -r is set to 
an identity matrix. When the loop continues with i = m — s + 1, the algorithm 
proceeds as in [TO] . 
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Case 2: A is an eigenvalue of A. Let I be the index of the first column of A that is 
a linear combination of the earlier columns. The algorithm proceeds as in Case 
1 until the loop with i = I. At that point, the leading (/ — l)-by-(Z — 1) principal 
submatrix of the new tridiagonal matrix is fixed (ignoring the addition of XI at 
the very end), and considering the proof of Theorem 13.41 the ith row of A is all 
zeros. Therefore, Gi is a rotation by \, which has the effect of swapping rows 
I and I + 1 of A. Now, the (I + l)st row of A is all zeros, so by induction, all 
remaining Givens rotations have angle -| and the row of zeros is pushed to the 
bottom of A. This constructs Q and R as in Case 2 of Definition [37T] □ 

□ 

3.2 SVD step 

The SVD algorithm of Golub and Kahan starts with the idea of applying the QR step 
to B T B — a 2 1. Because the formation of B T B is problematic in floating-point, the QR 
step must be executed implicitly, working directly with the entries of B. The following 
definition of the SVD step is unconventional but equivalent to the usual definition. 

Definition 3.9. Let B be a real bidiagonal matrix (either upper bidiagonal or lower 
bidiagonal) and a a nonnegative real number. Matrices B, S, and T are obtained from 
an SVD step if 

BB T - a 2 1 = SL T and B T B - a 2 1 = TR 
are preferred QR factorizations and 

B = S T BT. 

Note that 

BB T — L T S + a 2 I and B T B = RT + o 2 I, 
i.e., an SVD step on B implicitly computes QR steps for BB T and B T B. 
Theorem 3.10. The SVD step exists and is uniquely defined. 

Proof. This follows immediately from the existence and uniqueness of the preferred 
QR factorization. □ □ 

Theorem 3.11. If B is upper bidiagonal, then B is upper bidiagonal. If B is lower 
bidiagonal, then B is lower bidiagonal. 

The proof is presented after Lemma 13.171 below. 

Before going any further with the SVD step, we need a utility routine. 
Algorithm 3.12 (bulge_start). Given a 2-by-l vector x and a shift a > 0, bulge_start 

/ 2 2 \ T 2 2 \ 7 

computes a vector colinear with \X\ — o ,xiX2) ■ If (iei — a ,xiX2) is the zero vector, 
then bulge_start returns the zero vector. 

The implementation of bulge_start is omitted. LAPACK's DBDSQR provides 
guidance pQ. 

The following algorithm computes an SVD step for an upper bidiagonal matrix. It 
can also handle lower bidiagonal matrices by taking transposes as appropriate. 

Algorithm 3.13 (svd_step). Given an m-by-m upper bidiagonal matrix B and a 
shift a > 0, the following algorithm computes one SVD step. See Theorem 13. 141 
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B := B; 

for i := 1, . . . , m — 1 

(bulge_start(B(i, i : i + 1) T , cr) if i = 1 or B(i - 1, i : i + 1) = (0, 0) 

I _B(i — 1, i : i + 1) otherwise; 
Ti := givensfm, i,i + l,v); 
B := BT Z ; 

fbulgej3tart(B(i : i + 1, i + 1), a) if B(i : i + = (0,0) T 
U := < _ 

I £>(i : i + 1, i) otherwise; 
Si := givens(m, i, i + 1, w); 
B := Sj S; 

end for 

S := oi • • • i5 m _i; T :— Ti ■ ■ ■ T m -i; 

Theorem 3.14. Given a real upper bidiagonal matrix B and a shift a > 0, 

1. Alaorithm \3.13\ comvutes an SVD step. 

2. B is real upper bidiagonal. 

The proof follows immediately from Theorem 13.81 and the following three lemmas. 

Lemma 3.15. Upon completion of line^ B is upper bidiagonal, with the possible 
exception of a "bulge" at (i + l,i). Upon completion of line [TI\ B is upper bidiagonal, 
with the possible exception of a "bulge" at (i, i + 2) if i < m — 1. Upon completion of 
the entire algorithm, B is upper bidiagonal. 

The proof is omitted; the bulge-chasing nature of the algorithm is standard. 

Lemma 3.16. Let B be an m-by-m real upper bidiagonal matrix and a a nonnegative 
real number. Run A laorithm \3.13\ to produce Ti, . . . ,T m -i, and run Algorithm \3. 7| with 
A = B T B and A = a 2 to produce Gi, . . . , G m -i ■ Then Ti — Gi for i = 1, . . . , m — 1. 

The proof can be found in Appendix 1X1 

Lemma 3.17. Let B be an m-by-m real upper bidiagonal matrix and a a nonnegative 
real number. Run Alaorithm \3.13\ to produce Si, . . . , S m -i, and run Algorithm \3. 7| with 
A = BB T and A = a 2 to produce Gi, . . . , G m -i- Then Si = Gi for i = 1, . . . , m — 1. 

The proof can be found in Appendix lAl 

Proof of Theorem \3.11\ \i B is upper bidiagonal, then B can be obtained from Algorithm 
13.131 which produces upper bidiagonal matrices. If B is lower bidiagonal, then apply 
the same argument to B T . □ 

4 Phase II: Algorithm cscLstep 

Now we can return to the CSD algorithm. Phase I, which was already seen, transforms 
the original partitioned unitary matrix to bidiagonal block form. Phase II, which is 
developed now, iteratively applies the SVD step to each of the four blocks of this 
matrix. Algorithm csd_step executes a single step in the iteration. 
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4.1 Existence of the CSD step 

be a real orthogonal matrix in bidiagonal block 



Definition 4.1. Let 



B21 B22 

form and \i and v nonnegative shifts satisfying /i 2 + v 2 = 1. The matrix equation 



Bu 


B\2 




' Di 




' Si 


T ■ 


B21 


B22 








S 2 _ 







B12 




' Ti 






B21 


B22 




T 2 _ 




£ 2 _ 



effects a CSD step if 



B u i-> Si fluTi and B22 

?T 1 



is a matrix in bidiagonal block form. 



S2 B22T2 are SVD steps with shift fi, 

B12 1 ^ S[B 12 T 2 and B 2 i >-> Sj B21T1 are SVD steps with shift i/, 

Di, D2, £1, and -E2 are diagonal signature matrices, and 

Bu B\2 
B21 B22 

The existence of the CSD step is not obvious at first glance. It depends on several 
SVD steps being related in very specific ways, e.g., Bu and B21 having the same right 
orthogonal factor Ti. The following theorem establishes that yes, indeed, the CSD step 
exists, and its proof makes clear the necessity of the restriction /1 2 + v 2 — 1. 

Theorem 4.2. The CSD step exists and the result 



Bu 
B21 



B\2 

B22 



is uniquely defined. 



Proof. Begin with the identities 








B uBu 


2 T 

- n I = 


— (B2I-B2I 


-v 2 I), 


B22 B22 


2 r 


— (-B12-B12 


-u 2 I), 


BuBu 


2 r 

— n 1 = 


— (B12B12 


-u 2 I), 


B22B22 


2 T 


— (B21B21 


-u 2 I). 



(4.1) 
(4.2) 
(4.3) 
(4.4) 

Each is proved using orthogonality and the relation /i 2 + v 2 = 1. For example, the 
orthogonality of [ B xl ] i m ph es BuBu + 5^521 = /, and splitting the right-hand- 
side / into fi 2 I + v 2 I and rearranging gives (14.11) . 

Define Bij — SijBijTij by an SVD step with the appropriate shift (fi if i = j or v 
if i 7^ j). This produces a total of eight orthogonal factors Sij, Tij, but only four are 
required for the CSD step. In fact, by uniqueness of the preferred QR factorization, 
T11 = T21, T12 = T22, Su = S12, and S21 = S22- (For example, Tu is the orthogonal 
factor in the preferred QR factorization BuBu — fi 2 I = TuRu, and T21 is the or- 
thogonal factor in the preferred QR factorization B21B21 — v 2 l = T^iifei- Considering 
(|4.ip and the uniqueness of the preferred QR factorization, we must have T21 — Tu 
and R21 = ~Ru.) Hence, it is legal to define Ti = Tu = T21, T 2 = T12 = T22, 
Si = Su = S12, and S2 = S21 = S22- 

Finally, Di, D2, E\, and E2 are designed to fix the sign pattern required for a 
matrix in bidiagonal block form. Their existence is guaranteed by Theorem 12.41 

Regarding uniqueness, Si, S2, Ti, and T2 are uniquely defined by the preferred 
QR factorization, so SfBijTj, i,j = 1,2, are uniquely defined. This uniquely defines 
the absolute values of the entries of Bij, i,j — 1,2, and the signs are specified by the 
definition of bidiagonal block form. □ □ 
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The obvious way to compute a CSD step is to compute four SVD steps and then 
to combine them together as in the proof. Of course, when working in floating-point, 
the identities such as Tn = T21 typically will not hold exactly. In fact, when singular 
values are clustered, the computed Tn and T21 may not even bear a close resemblance. 

Our approach is to interleave the computation of the four SVD steps, taking care 
to compute Si, S2, Ti, and T2 once and only once. We find that when one block of a 
matrix provides unreliable information (specifically, a very short vector whose direction 
is required for a Givens rotation), another block may come to the rescue, providing 
more reliable information. Hence, the redundancy in the partitioned orthogonal matrix, 
rather than being a stumbling block, is actually an aid to stability. 



4.2 Algorithm specification 

The following is a specification for Algorithm csd_step, which accomplishes one step 
in Phase II of the CSD algorithm. 

Specification 4.3. The input to csd_step should consist of 

1. 6> (n) G [0, § ] q and (j> (n) G [0, f implicitly defining a matrix in bidiagonal block 



form 



„(«) 



2. integers 1 < i < i < q identifying the current block in a partially deflated matrix- 

,(n) „(»)' 

must have the form 



* 











* 
















5) 



* 





■°12 


(i 


i,i : i) 



* 


* 











* 












a 21 




5) 



* 





^22 


(i 


i,i : i) 






(4.5) 



3. shifts < fx, v < 1 satisfying fi 2 + v 1 = 1. 
The algorithm should compute one CSD step with shifts and v to effect 





B["\i : i,i: i) 


n 12 


(i 




1 — >■ 


" B11 


B12 




B^(i : i) 


r> 22 


(1 






B21 


B22 


The output should consist of 9 (n ~ 




[0, 


f ] 9 and 


^ (n+1) G [0, 


2J ' 



,(n+l) „(" + !) 



(4.6) 



that results from replacing the appropriate submatri- 



ing the matrix 

ces from ((43} with the corresponding submatrices of (|4.6[) . and orthogonal matrices 



(n+l) !/("+!) v< n+1 ) 



, C 2 in+l; such that 



R(n+1) 

R («+l) 
• D 21 



R(n+1) 
D 12 
r(«+1) 
"22 



(n+l) 



V. 



(n+l) 



sir b} 2 ' 

flW b 



(") 



(n+l) 



c 



(n+l) 



(4.7) 
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4.3 The algorithm 

The naive idea for implementing the above specification is to compute four SVD steps 
separately. As mentioned in the proof of Theorem 14.21 this produces eight orthogonal 
factors when only four are needed. In fact, in light of (14.1[) - (|4.4[) and Theorem 13.51 
the Givens rotations defining Si, 52,21, and T2 come in identical pairs. Of course, 
in floating-point, the paired Givens rotations would not actually be identical. Hence, 
the naive algorithm suffers from the standpoints of efficiency (each Givens rotation is 
needlessly computed twice) and stability (what happens when two computed Givens 
rotations disagree?). 

Our solution, suggested by Fig. [2j is to execute the four SVD steps simultaneously, 
computing each Givens rotation once and only once through a bulge-chasing procedure 
that treats all four blocks with equal regard. 

The algorithm makes use of a routine called merge. In the absence of roundoff 
error, merge is essentially a no-op; given two vectors in any one-dimensional subspace, 
it returns a vector in the same subspace. (And merge((0, 0) T , (0,0) T ) = (0,0) T .) In 
the presence of roundoff error, merge is used to ameliorate small differences resulting 
from previous roundoff errors. 

Algorithm 4.4 (csd_step). This algorithm performs one CSD step on a matrix in 
bidiagonal block form and satisfies Specification 14.31 Its correctness is established in 
Theorem 14.51 



V11 ■= bulge_start(£?ii (i, i : i + 1) T , ^1); 
V21 :— bulge_start (B21 : i + 1) T , v)\ 
V! := merge^n,^!); 
Ti,i := givens ( q, + l,Vi); 

[-§11 S 12 ] ._ [Sn B 12 ] \T 1 _i ]. 
L-B21 -B 22 J • [B 21 B 22 \ [ I]' 



Explicitly construct [ 



B11 b 12 
S21 B22 



] from {6 {n \<t> {n) ); 





1 bulge_start(_Bn (i : i + 1, i + 1), n) 
bulge_start (B12 (i : i + 1 , i) , v) ; 



if nonzero 



otherwise; 



M12 : 




1 bulge_start(B2i [i : i + 1, i + 1), v) otherwise; 

M22 := bulge_start(i?22(i : i+ l,i),/i); 

ui := merge(un, U12); u 2 := merge(it 2 i, U22); 

Si :i ■— givens(g, i, i + l,Ui); S 2 ,i := givens(g, i, i + 1,1*2); 



if nonzero 




for i := i + 1, . . . , i 



- 1 




bulge_start(£?2i(i, i i + otherwise; 



bulge_start(Bii(z, i : i + l),/i) otherwise; 



if nonzero 



if nonzero 
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12 



V-22 



{Bi2(i — l,i — 1 : l) if nonzero 

bulge_start(Bi2(i, i — 1 :i),v) otherwise; 
J Bnii — l,i — 1 : l) if nonzero 

1 bulge_start(i?22(i, i — 1 : otherwise; 
ui := merge(?;ii,i;2i); v 2 := merged, 1^2); 
Ti,, := givens(q, i, i + 1, t>i); T 2 , 4 -i := givens(q, £ - 1, i, w 2 ) 

[ill S 12 ] . = [B„ S 12 ] \T lti 
[B21 B 22 \ ■ [B 21 B 22 \ [ T 2i _ 1 



Mil := 



U12 



{Bn (i : i + 1, i) if nonzero 

bulge_start(Bn (i : i + l,i + 1), fi) otherwise; 

B\2(i '■ i + l,i — 1) if nonzero 

bulge_start(_Bi2(i : i + 1, i), v) otherwise; 

{B 2 i(i : i + 1, i) if nonzero 

bulge_start(f? 2 i(i :i + l,i + l),v) otherwise; 
J B22(i : i + 1, i — 1) if nonzero 

1 bulge_start(_B22(i : i + 1, i), ^1) otherwise; 
mi := merge(un, U12); u 2 := raerge(u 2 i, U22); 
S\,i ■— givens(g, i,i + l,ui); S 2 ,i := givens(g, i, i + 1, u 2 ); 

[ill -B12] . = [" S M 1 T [Bh B 12 

[B 21 B 22 J ' [ Sa -' J L-B21 -§22 



^22 



end for 



V12 := 



«22 := 



{B\2(i — l,i — 1 : i) if nonzero 

bulge_start(Bi2(i, i — 1 : i), otherwise; 

{5 22 (i — 1, i — 1 : i) if nonzero 

bulge_start(i?22(j, i — 1 : i),fJ>) otherwise; 
V2 := merge(tii2, V22); 
T 2,l-i : = givens(gj- l,i,v 2 ); 
[Su -B12] . = [Bu S12] [/ 1. 

LS21 -B22 J ' L-§21 S 22 J L J 2,,-l J ' 

r r(n + l) ._ Q Q . 7-7-(ll + l) Q Q 

U l "1,7-1) U 2 ■— "2,1 • ' ■ <3 2 J-1> 

^(n+1) . = Ti( . . . . j,^^ ^n + 1) ;= Ta ^ _ , 

Fix signs in [f^ §^ ] to match JT3J), negating columns of f/ 1 (n+1) , C/^ n+1) , V; ( 
and V 2 ' n+1 ' as necessary; 

Compute (6 {n+1 \(j> {n+1) ) to implicitly represent [f^ 

Note that it is possible and preferable to enforce signs and compute Q( n+1 ^ and 
if > ( n + 1 ) as Givens rotations are applied, instead of waiting until the end of the algorithm. 

Theorem 4.5. If arithmetic operations are performed exactly, then Algorithm csd_step 
satisfies Specification^ 



(n+l) 



Proof. The proof is organized into three parts. First, it is shown that the algorithm 
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computes SVD steps for the top-left and bottom-right blocks. Then, it is shown that 
the algorithm computes SVD steps for the top-right and bottom-left blocks. Finally, 
utilizing the proof of Theorem 14.21 it is shown that the algorithm as stated simulta- 
neously computes SVD steps of all four blocks and hence a CSD step for the entire 
matrix. 

For the first part, temporarily delete lines [6l [TTl [T2l [2T1 [221 [29l [30l [381 [471 and [491 

and make the following replacements: 



line original replacement 



m 


Ml 


:= merge(i;ii,t;2i); 






Ml 


:= «n; 






M 


Ml 


:= merge(un, mi 2 ): 


U2 


■— merge(?t2i, M22); 


Ml 


:= Mn; 


U2 


■— U22; 


M 


Ml 


:= merge(i;ii, V21); 


V2 : 


= merge(ui2,U22); 


Ml 


:= M11; 


V2 : 


= V22; 




Ml 


:= merge(Mii,Mi 2 ): 


U-2 


:= merge(it2i,M2 2 ); 


Ml 


:= Uu; 


U2 


■= U22; 


m 


M2 


:= merge(wi2,W22); 






M2 


:= V22; 







All references to the top-right and bottom-left blocks have been removed, and the 
resulting algorithm is equivalent to running Algorithm svd_step on the top-left and 
bottom-right blocks. Hence, U{ n+1 ^ and are the outer factors from an SVD 

step on the top-left block, and i7 2 n+1 ' ) and V 2 ( - n+1 - ) are the outer factors from an SVD 
step on the bottom-right block. 



For the second part, revert back to the original algorithm but make the following 
changes, intended to focus attention on the top-right and bottom-left blocks: Delete 
lines [5] [TOl [T3l [201 [23l [28l [3T1 l39l l47l and l49l and make the following replacements: 



line original replacement 



m 


VI 


:= merge(i>ii, 


M21); 






Ml 


:= M21; 






M 


Ml 


:= merge(un 


, M12) 


; U2 


:= merge(it2i, M22); 


Ml 


:= M12; 


M2 


■— U21] 


M 


Ml 


:= merge(t;ii, 


M21); 


V2 : 


= merge(t)i2,U22); 


Ml 


:= m 2 i; 


v 2 : 


= Vl2\ 




Ml 


:= merge(Mn 


,Ml 2 ) 


; «2 


:= merge(M 2 i,M22); 


Ml 


:= M12; 


U2 


■— U21; 


m 


M2 


:= merge(«i2, 


M22); 






M2 


:= M12; 







This time, the algorithm is equivalent to running Algorithm svd_step on the top-right 
and bottom-left blocks, and so £/} n+1 ' and V 2 ' n+1 ' are the outer factors from an SVD 
step on the top-right block and U^ 1 ^ and V{ n+ are the outer factors from an SVD 
step on the bottom-left block. 



By the existence of the CSD step, the two versions of the algorithm discussed above 
produce identical U[ n+l) , U^ n+1) , V 1 (n+1) , and V 2 (n+1) . Hence, by Theorem [531 the two 
versions produce identical Givens rotations. Therefore, in each invocation of merge in 
the final algorithm, one of the following holds: both vectors are nonzero and colinear, 
or one vector is the zero vector and the other vector points along the second coordinate 
axis, or both vectors equal the zero vector. In any case, the call to merge is well 
defined. Therefore, the final algorithm simultaneously computes SVD steps for all four 
blocks. 

The final two lines of code could be implemented (inefficiently) with a single call to 
bidiagonalize. It is straightforward to check that this would only change the signs of 
entries and would compute 6» (n+1) and (n+1) . □ □ 

5 Algorithm csd 

Algorithm csd is the driver algorithm, responsible for initiating Phase I and Phase II. 

Given an m-by-m unitary matrix X and integers p, q with < q < p and p + q < m, 
csd(X, p,q) should compute 9 = (0i,...,0 9 ) G [0, f ] q , Ui G U(p), U2 G U(m - p), 



Computing the Complete CS Decomposition, Brian D. Sutton 20 



Vi € U(q), and V2 £ U(m — q) satisfying 
" Ui 



X 



u 2 



c 


s 














Ip-q 





-s 


c 

















IjTl — p — q 



v 2 



with C — diag(cos(#i ),..., cos(9 q )) and S = diag(sin(#i ),..., sm(6 q )). It would be 
preferable to replace the approximate equality with an error bound, but a formal 
stability analysis is left to the future. 

csd's responsibility is to invoke bidiagonalize once and then csd_step repeatedly. 
The following code fleshes out this outline. 

Algorithm 5.1 (csd). Given X, p, and q, the following algorithm computes the 
complete CS decomposition in terms of 6, Ui, U2, Vi, and Vi. 

Find (o) , (o) , Pi, P 2 , Qi, and Q 2 with bidiagonalize; 

If any Of 1 ^ or </>- ' is negligibly different from or |-, then round; 

n := 0; 

Set i and i as in (|5.1[) : 
while i > 1 

if some fl ' n ' 



, i < i < i, is negligibly far from -|, then 
/1 := 0; v := 1; 
elseif some ° { " 

else 

Select shifts /1 and satisfying /1 2 + v 2 = 1; (* see discussion 
end if 



, , i < i < i, is negligibly far from 0, then 
1; v := 0; 



Compute 



(n+l) ^("+!) 77_( n+1 ) i/, (n+1) i/i n+1) 



with csd_step; 



n := n + 1 

q(n) 



If any (9^ n; or is negligibly different from or -| , then round; 
Set i and i as in (15.111: 



end while 



tfi := Pi((Ul L, U^' ■ ■ ■ U^) © /„_,); U 2 := P 2 ((^^^ ■ ■ ■ 0™) © /m-„- 9 ); 
ft := Ql(V 1 (1 V 1 C2) ■ ■ • VW); V 2 := Q 2 ((V 2 (1 V 2 (2) • ■ ■ V^) © / ra _ 2 ,); 

As promised, the algorithm calls bidiagonalize once and csd_step repeatedly. The 
subtleties are the deflation procedure and the choice of shifts. 



Deflation If the matrix obtained after n CSD steps, 



R (n) R (n) 
"11 D 

B, 



(n) 



12 
g(n) 



B 2 i(e in) ,^) 



B 12 (e {n) A {n) ) 

B2 2 (0 (n) ,0 ( " ) ) 
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has the form (|4.5[1 . then the next CSD step can focus on just a principal submatrix. 
The indices i and i are determined by 0'™' as follows. 

< none of 0[ n) , . . . , ^ equal (5.1) 
s i is as small as possible 

Hence, deflation occurs when some (f)^ transitions from nonzero to zero. 

Note that some 9^ may equal or ^ or some may equal producing zero 
entries in the matrix without satisfying (|5.ip . Fortunately, in this case one of the blocks 
has a zero on its diagonal, the appropriate shift is set to zero, and deflation occurs in 
the one block just as in the bidiagonal SVD algorithm of Demmel and Kahan 6 . (This 
is the reason for Case 2 of the preferred QR factorization.) It is easy to check that 
then one of the (p\ n+1 ^ becomes zero. Hence, as soon as a zero appears anywhere in 
any of the four bidiagonal bands, the entire matrix in bidiagonal block form deflates 
in at most one more step (assuming exact arithmetic). 

Choice of shift There are two natural possibilities for choosing shifts in line 1151 

• One possibility is to let [i or v be a Wilkinson shift (the smaller singular value 
of the trailing 2-by-2 submatrix of one of the blocks) from which the other shift 
follows by n 2 + v 2 = 1. This seems to give preference to one block over the other 
three, but as mentioned above, as soon as one block attains a zero, the other 
three blocks follow immediately. 

• Another possibility is to let fi and v be singular values of appropriate blocks, i.e., 
"perfect shifts." This keeps the number of CSD steps as small as possible [101 
p. 417] at the cost of extra singular value computations. 

Based on what is known about the SVD problem and limited testing with the CSD algo- 
rithm, Wilkinson shifts appear to be the better choice, but more real world experience 
is desirable. 

6 On numerical stability 

So far, the input matrix X has been assumed exactly unitary, and exact arithmetic has 
been assumed as well. What happens in a more realistic environment? The algorithm 
is designed for numerical stability. All four blocks are considered simultaneously and 
with equal regard. Nearly every computation is based on information from two different 
sources, from which the algorithm can choose the more reliable source. 

Below, some numerical issues and strategies are addressed. Then results from nu- 
merical experiments are presented. A BLAS/LAPACK-based implementation is avail- 
able from the author's web site for further testing. 

6.1 Numerical issues and strategies 

The better of two vectors in bidiagonalize. As mentioned in the discussion 
after Algorithm 12.21 most of the Householder reflectors used in the bidiagonalization 
procedure are determined by pairs of colinear vectors. If the input matrix X is not 
exactly unitary, then some of the pairs of vectors will not be exactly colinear. Each of 
the computations for Ui, 112, Vi, and 172 in Algorithm I2.2I performs an averaging of two 
different vectors in which the vector of greater norm is weighted more heavily. (Vectors 
of greater norm tend to provide more reliable information about direction.) 
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Implementation of bulge_start. The implementation of bulge_start (Algo- 
rithm requires care to minimize roundoff error. LAPACK's DBDSQR provides 
guidance Q]. 

Implementation of merge. In csd_step, most Givens rotations can be con- 
structed from either (or both) of two colinear vectors. The function merge in the 
pseudocode is shorthand for a somewhat more complicated procedure in our imple- 
mentation: 

• If the Givens rotation is chasing an existing bulge in one block and introducing 
a new bulge in the other block, then base the Givens rotation entirely on the 
existing bulge. (Maintaining bidiagonal block form is crucial.) 

• If the Givens rotation is chasing existing bulges in both blocks, then take a 
weighted average of the two vectors in the call to merge, waiting the longer 
vector more heavily. (Vectors of greater norm provide more reliable information 
about direction.) 

• If the Givens rotation is introducing new bulges into both blocks, then base the 
computation solely on the block associated with the smaller shift, either /i or 
v. (In particular, when one shift is zero, this strategy avoids roundoff error in 
bulge_start.) 

Representation of ffW and Angles of and £ play a special role in the 

deflation procedure. Because common floating-point architectures represent angles near 
more precisely than angles near -| , it may be advisable to store any angle ip as a pair 
(ijj, -| — tjj). This may provide a minor improvement but appears to be optional. 

Rounding of flW and c/)( n ' in csd. To encourage fast termination, some angles 
in 0^ n ' and <fy- n ' may need to be rounded when they are negligibly far from or -|. The 
best test for negligibility will be the subject of future work. 

Disagreement of singular values when using perfect shifts. If in csd, 

the shifts /i and u are chosen to be perfect shifts, i.e., singular values of appropriate 
blocks, then the computed shifts may not satisfy /i 2 + u 2 — 1 exactly in the presence 
of roundoff error. Empirically, satisfying /i 2 + v 2 = 1 appears to be crucial. Either 
fi or v should be set to a singular value no greater than and then the other shift 
computed from /i 2 + v 2 = 1. 



6.2 Results of numerical experiments 



The sharing of singular vectors in (|1.4[) is often seen as a hindrance to numerical com- 
putation. But in our algorithm, the redundancy appears to be a source of robustness — 
when one block provides questionable information, a neighboring block may provide 
more reliable information. Numerical tests support this claim. 
Our criteria for a stable CSD computation, 



X 



Ui 



c 


s 














Ip-q 





-s 


c 

















Ira — p — q 



v 2 



are the following: If X is nearly unitary, 

\\X*X -I m \\ 2 = e, 



(6.1) 
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then we desire 8 — (Ox, ... , 8 q ) and Ux, U2, Vx, V% such that 

C = diag(cos(#i), . . . , cos(9 q )) S = diag(sin(#i), . . . , sin(0 4 )) 

\\UtUx - I P \\ 2 & e \\US U 2 - I m - P h « e 
- /„|| a « e H^Fa - / ra _,|| a « e 



C 


s 



f/f(Xu+£u)T/i, ||J5u||a«e 



C 

J„ 



= l/s f(X 2 i + £ 2 i )Vx, \\E 2 xh^e 



— U2 (X22 + Saa)V2, H-B22H2 ~ e. 



Van Loan's example. 

\M. Let 



Our first test case is based on an example of Van Loan 



Xxx 



Xx2 



X2I 



X22 = 



0.220508860423 
0.075149984350 
0.346099513974 
0.200314808251 

0.123868614848 
0.505660921957 
-0.068044487719 
-0.339461927716 

-0.149903307775 
-0.132593956233 
0.631588073183 
-0.588949720476 

-0.211288905103 
-0.422173038686 
-0.473064671229 
-0.403033356444 



-0.114095899416 
0.552192330457 

-0.465523358094 
0.015869922033 

-0.424487382687 
0.028765021298 
-0.292950312278 
-0.307319405113 

0.456869095895 
0.403919514293 
0.226164206817 
-0.205112923304 

-0.065095708488 
-0.565182436669 
0.502284642254 
0.250518329548 



0.001410518052 
0.309420137864 
-0.147474170901 
0.063768831702 

0.756283107266 
-0.138696588123 
-0.202722377746 
-0.530848659627 

-0.814555019070 
0.374067025998 
0.132173742848 
0.239887841318 

0.064582503584 
0.079260723473 
0.218767959397 
0.166101999167 



0.309131888087 
0.519525649668 
0.284504924779 
0.364621650530 

-0.274401793502 
0.219160328651 
0.655183291894 

-0.575436177767 

0.205461483909 
-0.294979263882 
0.047014825861 
0.537774110108 

0.100169729053 
0.297296887111 
0.079539299401 
0.107399029584 



Xi 
.x 2 



and let X = \ v xl Y 12 1 . Van Loan considered the submatrix 

LA21 A22 j 

X satisfies (|6.1[) with e ~ 3.4 x 1CP 12 . Our implementation produces 6, U\, 
U 2 , Vx, axidV 2 for which \\U?Ui-I 4 \\ 2 « 1.2xl0- 15 , \\U^U 2 -I 4 \\ 2 « 1.4xl0" 15 , 
\\VfVx - / 4 || 2 » 5.8 x 10- 16 , \\V 2 T V 2 - h\\ 2 « 1.7 x lO" 15 , ||S n || 2 w 1.3 x 10~ 12 , 
\\E 12 \\ 2 « 6.1 x 10- 13 , HE21H2 « 1.6 x 10- 12 , and \\E 22 \\ 2 « 9.3 x lO" 13 . The 
algorithm performs stably. 



Haar measure. Let X be a random 40-by-40 orthogonal matrix from Haar 
measure, let p — 18 and q — 15, and compute the CS decomposition of X 
using csd. Actually, it is impossible to sample exactly from Haar measure in 
floating-point, so define X by the following Matlab code. 

[X,R] = qr(randn(40)); 
X = X* diag ( sign (randn (40 , 1 ) ) ) ; 
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Over 1000 trials of our implementation, \\Tf[Ui -iislla, \\U^U 2 -I 22 \\ 2 , \\V?Vi- 
hth, \\V 2 T V 2 -I 25 \\ 2 , \\Enh, ||S 12 || 2 , ||£ 2 i|| 2 , and \\E 22 \\ 2 were all less than 2e, 
where e was defined to be the greater of ten times machine epsilon or ||X T X — 

Clusters of singular values. Let Si, S 2 , #3, . . . , 621 be independent and identi- 
cally distributed random variables each with the same distribution as 10 -18C/ ( 0,1 ) , 
in which C/(0, 1) is a random variable uniformly distributed on [0, 1]. For i = 

1,...,20, let 6i = f • ^r 1 r > and lct c = diag(cos(6>i), . . . , cos(6> 20 )), 5 = 
diag(sin(6 l i), . . . , sin(0 2O )), and 



X = 



' Ui 




c 


s ' 




' v 1 


u 2 




-s 


c 




v 2 



in which U\, U 2 , V\, and V 2 are random orthogonal matrices from Haar measure. 
(These matrices can be sampled as X was sampled in the previous test case.) 
The random matrix X is designed so that C and S have clustered singular values, 
as well as singular values close to and 1. Such singular values break the naive 
CSD algorithm. Compute the CSD of X with p = q = 20 using csd. Over 1000 
trials of our implementation, \\U^Ui - Ii&\\ 2 , \\U 2 U 2 — I 22 \\ 2 , HV^Vi — /15II2, 
\\V 2 T V 2 - 7 25 || 2 , ||£n||a, \\Ei 2 \\ 2l H-E21II2, and \\E 22 \\ 2 were all less than 3e, with 
e defined as in the previous test case. 

and <^°) chosen uniformly from [0, |] Choose 9\, . . . , 9 2 o and 4>\g 
independently and uniformly from the interval [0, ?1, and let 

r B u (e,<i>) Bx2(e,<j>) ' 

[ B 21 (M) B 22 {6,<p) ■ 

Compute the CSD of X with p = q = 20 using csd. Over 1000 trials, \\tf[Ui - 
hah, \\U 2 T V 2 -I 22 \\ 2 , UV^Vi — Jislla, \\V 2 T V 2 -I 25 \\ 2 , ||£ u || 2 , ||£ 12 || 2 , \\E 21 \\ 2 , 
and || £22 1| 2 were all less than 4e, with e defined as above. 

9^ and <//°) chosen randomly from {0, \,\ }• Repeat the previous test 
case, but with #i,...,0 2 o and <j)i, • . . ,<j)ig chosen uniformly from the three- 
element set {0> f j ? }• This produces test matrices with many zeros, which can 
tax the novel aspects of our extension of the Golub-Kahan-Demmel SVD step. 
Over 1000 trials, - I 18 \\ 2 , ||[/ 2 T (7 2 -7 22 || 2 , || V?Vi -Jib 1| 2 , \\V 2 T V 2 - 1 25 \\ 2 , 

II Jailh, H-E12II2, H-E21H2, and H-E22II2 were all less than e, with e defined as above. 

Acknowledgment. The thoughtful comments of an anonymous referee are ap- 
preciated. 



A Additional proofs 
A.l Proof of Lemma 13.161 

Proof. The proof is by induction on i. 
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That T\ = G\ is straightforward to prove. 
Assume by induction that Tj = Gj, j = 1, 
Km B = Sf^ ■ ■ ■ SfBTi ■ ■ ■ r„_i, and so 



i — 1. At line [6] of Algorithm 



B T B - a 2 I = TT 



■T T (B T B - cr 2 /)^ • • • Tj_i. 



At line Hot Algorithm [3J 



•Gf(A-AJ)Gi---Gi_i. 



By the induction hypothesis and the fact that B T B — <r 2 I = A — XI, we have 
(at the current step), 

B T B — cr 2 I = A. (A.l) 

We show Ti = d up to signs using three cases. 

Case 1: A(i : i + 1, i — 1) is nonzero. From (|A. 1|) . A(i : i + l,i — 1) = 
B(i—1, i—l)B(i—l, i : i+l) T . Hence, B{i— 1, i : i+l) T is nonzero, A(i : i+1, 
is colinear with B(i — l,i : i + 1) T , and = G{. 

Case 2: A(i : i + l,i — 1) and i?(i — l,i : i + 1) T egwa/ </ie zero vector. 
Then the Givens rotations Ti and Gi are based on A{i : i + and (\\B(: 
, i)|| 2 — tr 2 , i) T B(: } i + 1)) T , respectively, which are equal according to (|A.1[) . 
Hence, T 4 = G l . 

Case 5: A(i : i + 1, i — 1) is f/ie zero vector but B{i — 1, i : i + 1) is nonzero. 
Then _B must have the form 



a 6 

^ = j 
c a 



with a and b not both zero. Let's roll back the previous Givens rotation: 



(Sf^i) 1 B — Si-\B — 



x 
y z 

* 



Hence, 



A = B T B-a 2 I = {S l _ 1 B) T {S l ^ 1 B)-a 2 I = 



* * * 

* * 

„2 I ,,2 _2 



X + y — (7 J/2 
J/Z * * 
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Now, St-i = givens(i - l,i,{x 2 - a 2 ,xy) T ) = ±± 
which r = \\(x 2 — a 2 ,xy) T \\, so 



x 2 -a 2 



-xy 

2 2 

xy x — a 



B — Sf_ 1 S i ^iB — 



* * 

±f {x 2 +y 2 - a 2 ) ±fyz 
T ±f(-(7 2 ) i^a^-o 3 )* 



By assumption, B(i — l,i : i + 1) T is nonzero, and hence by inspection B(i — 



1) and A(z : i + l,i) are colinear. Therefore, Tj = G^. □ 



□ 



A. 2 Proof of Lemma 13.171 

Proof. Assume by induction that Sj = Gj for j = 1, . . . ,i 
Algorithm GS1 B = Sj^ ■ ■ ■ g[BT x ■■■T i , and so 



1. At line Q! of 



BB T - n 2 I = Sf_ x 



■S'[(BB T -<r 2 I)Si---Si-i. 



At line H of Algorithm E3 



A = Gf_j ■■■Gi(A- XI)Gx ■ ■ • Gi_i. 

By the induction hypothesis and the fact that BB T — a 2 1 = A — XI, we have 
(at the current step), 

BB T - a 2 I = A. (A.2) 

We show Si = d using three cases. 

Case 1: A{i : i + l,i — 1) is nonzero. From (|A.2j) . A(i : i + \,i - 1) = 
-B(i — 1, i)£?(i : i + Hence, B(i : i + 1, i) is nonzero, A(i : i + 1, i — 1) is 

colinear with B(« : i + 1, i), and Si = Gi. 

Case 2: A{i : i + 1, { — 1) and £?(i : i + 1, i) egtia/ t/ie zero vector. Then the 
Givens rotations Gi and are based on A(i : and (\\B(i, :) || 2 — a 2 , B(i, : 

+ 1, :)), respectively, which are equal according to (|A.2j) . Hence, Si = Gi. 

Case 3: A(i : i+ 1, i — 1) is </ie zero vector but B(i : i + 1, i) is nonzero. Then 
B must have the form 

* * 
* 
a c 
b d * 

* * 



B = 



with a and 6 not both zero. Let's roll back the previous Givens rotation: 



BTr 1 = BT! = 



* 

x y 
z * 

* 
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Hence, 



A = BB T -a 2 I = (BT?')(BT?y 1 -a z I 



* * 

x 2 + y 2 - a 2 yz 
yz * * 



Now, Ti = givens(i, i + 1, (x 2 - a 2 ,xy) T ) = ±± 
r = \\{x 2 — (T 2 , xy) T \\, so 



' — a —xy 
xy x z — a 



, in which 



B = BT/ T, 





±f(x 2 +y 2 -a 2 ) ± v -{-a 2 ) 

±-yz ±-(x 2 - a 2 )z * 



By assumption, B(i : i + 1, i) is nonzero, and hence by inspection B(i : i + 1, i) 
and A(i : i + are colinear. Therefore, Si — Gi. □ □ 
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